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Overview and Background 


Several challenges 
affecting technical 
adequacy have impacted 
states’ abilities to meet 
the needs of their special 
student populations. 
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Since the enactment of the No Child 
Left Behind Act of 2001 (NCLB), and 
particularly in relation to Title I and 
Title III, assessments for special stu- 
dent populations are undergoing in- 
creased scrutiny. Ensuring the techni- 
cal adequacy of assessments for English 
language learners (ELLs) and students 
with disabilities (SWDs) is critical, 
given the high stakes associated with 
the outcomes of these measures. State 
departments of education, policymak- 
ers, and test developers have attempted 
several strategies to satisfy the NCLB 
requirements for valid, reliable, and ac- 
cessible assessments for special student 
populations. In practice, however, sev- 
eral challenges affecting technical ade- 
quacy have impacted states’ abilities to 
meet the needs of their special student 
populations (Rabinowitz, Ananda, & 
Bell, 2005). These challenges include 
demonstrating technical adequacy and 
comparability of assessments for special 
student populations and their general 
education counterparts and ensuring 
consistency of meaning across assess- 


ments for general population stu- 
dents and students with special needs 
(Abedi, 2004; Bowe, 2000; Center for 
Equity and Excellence in Education, 
2005; Rochester, 2004). 

Recent research has shown that the 
technical adequacy of assessments 
for special student populations is 
relatively undeveloped compared to 
their general education counterparts; 
that is, the technical evidence pro- 
vided and the methods by which this 
evidence is established do not neces- 
sarily account for the unique charac- 


Technical criteria and the 
methods by which these 
criteria are applied must 
account for the unique 
characteristics of special 
student populations. 

teristics of special needs populations 
(i.e., ELLs, SWDs) or the assessed 
domains (e.g., English language pro- 
ficiency) (Rabinowitz & Sato, 2005). 
While there is substantial overlap 
between the procedures and criteria 
found appropriate and essential for 
determining the technical adequacy 
of special population assessments ver- 
sus their general education counter- 
parts, there is not complete overlap. 
Some technical criteria do not trans- 
fer directly or are less critical when 
applied to a technical review of as- 
sessments for special student popula- 
tions (Rabinowitz & Sato, 2005). 

This document presents an ongoing 
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Evaluation of Technical Evidence of Assessments for Special Student Populations 


The technical criteria 
used in this evaluation 
are validated and are 
sensitive to the unique 
characteristics of special 
student populations, the 
particular purposes of the 
assessments, and the 
stage of development 
and maturity of the 
assessments. 


evaluation conducted by the Special 
Populations Strand of the Assessment 
and Accountability Comprehensive 
Center (AACC) and is intended to 
inform developers and consumers of 
assessments for special student popu- 
lations (ELLs and SWDs). The evalua- 
tion focuses on the technical adequacy 
(i.e., validity, reliability, freedom from 
bias) of evidence related to assessments 
used to meet relevant Title I and Title 
III requirements under NCLB. The 
technical criteria used in this evalua- 
tion are validated and are sensitive to 
the unique characteristics of special 
student populations, the particular 
purposes of the assessments, and the 
stage of development and maturity of 
the assessments. 1 

According to the Standards for 
Educational and Psychological Testing 
(AERA, APA, & NCME, 1999) there 
are multiple elements that contribute 
to the technical adequacy of high- 
quality assessments. Until recently, the 
range of sources of technical evidence 
available had not been aggregated and 
rigorously evaluated in any methodi- 
cal fashion. Therefore, the AACC is 
applying a comprehensive set of vali- 
dated criteria based on those devel- 
oped by Rabinowitz and Sato (2005) 
to evaluate the technical evidence as- 
sociated with assessments for special 
student populations (see Appendix A 
for a list of the technical criteria and 
Appendix B for the operational defi- 

1 The first set of assessments evaluated vis-a-vis 
the technical criteria is English language proficiency 
assessments for ELLs. For evaluation summaries 
of the assessments reviewed, please go to 
http://www.aacompcenter.org and click on the 
Special Populations link. Assessments for SWDs 
will be reviewed subsequently. 


nitions of the criteria). As mentioned 
previously, these criteria are sensitive 
to the unique characteristics of special 
student populations, the particular 
purposes of the assessments, and the 
stage of development and maturity of 
the assessments. Using the results of 
this evaluation, assessment developers 
and consumers will be able to gauge 
the technical adequacy of the assess- 
ments they are using or are consider- 
ing for use with their special student 
populations and identify appropriate 
and necessary next steps for ensuring 
the assessments’ validity and, ultimate- 
ly, the defensibility of the assessment 
and its results. 

As mentioned previously, this evalua- 
tion builds off of research conducted by 
Rabinowitz and Sato (2005) that exam- 
ined the technical adequacy of evidence 
associated with ELL assessment types 
(i.e., consortium-developed, custom- 
developed [publisher], and not custom- 
developed [publisher]). The AACC 
evaluations presented here are reviews 
of technical evidence related to specific 

Many assessments 
for special student 
populations currently 
used for accountability 
purposes should be 
considered works in 
progress. 

ELL assessments (rather than assess- 
ment types) in order to better inform 
states about the technical adequacy 
(i.e., validity, reliability, freedom from 
bias) of the assessments available for use 
with their ELLs (technical reviews of as- 
sessments for SWDs will be made avail- 
able as this evaluation work continues). 
As suggested by the research conducted 
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by Rabinowitz and Sato (2005), many 
assessments for special student popula- 
tions currently used for accountability 
purposes should be considered works in 
progress because they have been devel- 
oped since the advent of NCLB, mak- 
ing current technical evidence prelimi- 
nary, at best. 

The Evaluation of 
Technical Evidence: 
Materials and Procedures 

Materials 

As mentioned previously, the first set 
of assessments evaluated was for ELLs. 
AACC staff compiled a state -by-state 
listing of English language proficiency 
(ELP) assessments used for ELL 2 stu- 
dents and collected available technical 
documentation related to specific as- 
sessments. Generally, the ELP assess- 
ments fell into three main categories: 
consortium-developed; custom-de- 
veloped (publisher); and not custom- 
developed (publisher). Attention was 
focused on those more formal assess- 
ment efforts with sufficient resources 
behind them to likely meet technical 
adequacy requirements. 

Available technical documentation 
published by the test developers was col- 
lected for each assessment (e.g., techni- 
cal manuals and reports). Additional 
documents were identified for review by 
using the following resources: 

• Experts in the areas of 

2 For the purpose of considering materials for 
inclusion in this study, the category of English 
language learner (ELL) was defined in its broadest 
sense because part of this analysis involved the 
evaluation of the adequacy of the target population 
definition and subsequently implications for the 
technical adequacy of evidence presented, as it 
relates to this population (e.g., sampling and bias). 


assessment, accountability, and 
ELLs; 

• State-level contacts; 

• Journal articles, technical 
bulletins, and reports; 

• Conference presentations; 

• Manuals and other 
documentation from state and 
national assessment programs; 
and 

• Reliability and validity studies 
from relevant assessment 
programs. 

Criteria 

Technical evidence was evaluated 
against a comprehensive set of cri- 
teria based on widely known and re- 
spected standards (e.g., What Works 
Clearinghouse, 2004a, 2004b; AERA, 
APA, & NCME Joint Standards, 1999; 
Becker & Camilli, 2004) as well as 
principles underlying rigorous, scien- 
tifically-based research (see Appendix 
A for a list of the technical criteria and 
Appendix B for the operational defini- 
tions of the criteria). Generally, the 
criteria are related to assessment valid- 
ity, reliability, and freedom from bias. 
A group of experts that collectively had 
experience in large-scale test develop- 
ment, psychometrics, English lan- 
guage development, the English lan- 
guage learner population, and techni- 
cal assistance/consultation to state de- 
partments of education were convened 
to review and validate the criteria and 
their operational definitions. 3 

3 The technical criteria and operational definitions 
developed by Rabinowitz and Sato (2005) were 
originally developed for research purposes. 
Therefore, with the consent of the authors, the 
AACC has made refinements to both the technical 


The evaluation 
necessarily involved 
both technical and 
content perspectives. 


criteria and operational definitions in order to make 
this information more practicable to states and test 
publishers. 
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Use of 
appropriate 
and technically 
defensible 
assessments is a 
key to reform in 
the NCLB era. 


Analysts 

The evaluation necessarily involved 
both technical and content perspectives. 
Analysts possessed experience in large- 
scale test development, and collective- 
ly, they possessed training in measure- 
ment, statistics, and applied linguistics 
and had English language development 
expertise as well as experience teach- 
ing ELLs. The expertise of the analysts 
was critical because a technical under- 
standing of evidence related to validity, 
reliability, and bias was not sufficient 
to determine whether the evidence was 
fully appropriate for the ELL popula- 
tion and purposes of the assessments 
(i.e., the target population is much 
more narrowly defined than that of a 
typical statewide achievement test, and 
the ELP domain is necessarily assessed 
in a manner that differs from academic 
content domains). For example, if a 
section within a document described 
Differential Item Functioning (DIF), 
then from a technical perspective (mea- 
surement, statistics) the analyst would 
determine whether the DIF analyses 
were appropriately implemented (e.g., 
sufficient sample sizes, appropriate 
significance tests). From a content per- 
spective (applied linguistics, ELD), 
the analyst would determine whether 
the application of DIF from traditional 
testing programs was fully appropriate 
for the ELL population and purposes 
of the assessments that are the focus of 
this technical evaluation. 

Analysts were provided training on 
the technical criteria and evaluation 
protocol. As necessary, analysts creat- 
ed decision rules, which are guidelines 
for the application of the criteria that 
help to ensure accurate and consistent 


application throughout the evaluation 
process. Training ended when analysts 
could ensure consistent and accurate 
understanding and application of the 
criteria. 

Procedure 

Following their training on the tech- 
nical criteria and evaluation proto- 
col, analysts evaluated the available 
documentation for selected ELP as- 
sessments. A preliminary summary 
describing the technical evidence as- 
sociated with an assessment vis-a-vis 
the technical criteria was then writ- 
ten. This summary was sent to the test 
developer in order to solicit any com- 
ments and additional technical evi- 
dence the developer might have that 
would further inform the evaluation of 
the assessment’s technical quality. 

If a test developer submitted ad- 
ditional documentation for review, 
analysts would evaluate the documen- 
tation for consideration in the assess- 
ment’s body of evidence for validity, 
reliability, and freedom from bias. If 
a test developer chose not to submit 
additional documentation or did not 
provide comment on the preliminary 
summary, then the test developer’s re- 
sponse (or lack thereof) was noted in 
the body of evidence summary for the 
assessment. 

Test developers were provided the 
opportunity to review and comment 
on the body of evidence lists and sum- 
maries resulting from AACC evalua- 
tions. Once a test developer had the 
opportunity to provide additional 
documentation for consideration and 
to comment on the body of evidence 
summary created for its assessment, 
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the final summary was prepared for 
posting on the AACC website. 

As noted previously, this evaluation 
is ongoing; therefore, additional tech- 
nical evaluations of assessments will be 
conducted and summaries will be post- 
ed as they are completed. Currently, 
the AACC evaluations have focused 
on ELP assessments for ELLs. As the 
work continues, technical evaluations 
will include assessments for SWDs. 

States and other consumers of assess- 
ments for special student populations 


This need will only grow 
as the percentage of 
students who are English 
language learners and 
students with disabilities 
continues to increase 
throughout the nation. 
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reform in the NCLB era. Developers of 
assessments for special student popula- 
tions are encouraged to step up their 
efforts to ensure the technical adequa- 
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throughout the nation. ♦♦♦ 
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Evaluation summaries of assessments reviewed can he 
found at h ttp:llwww. aacompcen ter.org 
(see Special Populations page). 


For information about 
the AACC, visit 
http://www.aacompcenter.org . 


For questions about these technical criteria, contact 
Edynn Sato, Ph.D., Director, Special Populations, 
Assessment and Accountability Comprehensive Center, 
WestEd (email: esato@wested. org ). 

Please cite as: Sato, E., Rabinowitz, S., Worth, P., 
Gallagher, C., Lagunoff, R., & Crane, E. (2007). 
Evaluation of the Technical Evidence of Assessments 
for Special Student Populations. (Assessment and 
Accountability Comprehensive Center report). 

San Francisco: WestEd. 


The contents of this report 
were developed under a 
grant from the Department 
of Education. However, 
those contents do not 
necessarily represent the 
policy of the Department of 
Education, and you should 
not assume endorsement by 
the Federal Government. 
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Appendix A 

Assessments for English Language Learners : 

Technical Adequacy Criteria — Tiers 

Notes: Types of validity, reliability, and bias and sensitivity evidence associated with various phases of test development are 
presented in the table below. Tier 1 elements ought to be part of a test’s body of evidence. Tier 2 elements are important, 
but may or may not be available, depending on the nature and maturity of a particular test. 


Type 

Phase 

Tier 1 Elements 

Tier 2 Elements 

Validity 

Construct validity 

Test design/development (1 0) 

Test purpose 

Universal design 



Population/classification 

Readability 



Theoretical foundation/framework 

M u Iti -trait/m u Iti - method/subtest 
inter-correlation 



Standardization 

Equivalence/comparability 




Fidelity 




Accommodation 


Content validity 

Test design/development (1 4) 

Alignment (items-to-standards) 

Linkage (items-to-standards; 

standards-to-standards) 



Expert judgment 

Structural equation modeling 



Item fit 

IRT/item fit OR 
p-values/point biserials 

t- tests 



Test blueprint 

ANOVA 



Alignment (test form-to-blueprint) 

Factor analysis 



Test fit 

IRT/testfit OR 
Descriptive statistics 

Linking/equating* 



Field testing (3) 

Sampling OR 
Norming 

Blueprint 


Scoring (4) 

Scale 

Rubric 



Standard setting* 

Training of scorers/scoring protocol 


Criterion validity 

Test design/ development (2) 

Cross tabulations OR 
Pearson correlation 



Consequential validity 

Test design/ development (1) 

Use of results 



Reporting (4) 

N 

Effect size 



Central tendency/ 
variation 




Reporting category 



Security (1) 

Protocols 



"Placement in this column is related to the nature and maturity of the instrument. 
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Appendix A 


Type 

Phase 

Tier 1 Elements 

Tier 2 Elements 

Reliability 

Reliability 

Test design and development 
(11+1) 

Internal consistency 
KR-21 OR 

Coefficient alpha OR 
Split-half 

Test length/power estimates 



Standard error of measurement/ 
confidence intervals 

Generalizability 

G-coefficient 



Test- retest OR 
Alternate form* 

Classification consistency 
Percent correspondence OR 
Correlation coefficient OR 
Classification error 


Scoring (2) 

Inter-rater reliability 
Percent correspondence OR 
Correlation (kappa) 


Bias and Sensitivity 

Bias and sensitivity 

Test design/development (13) 

Expert review: 
Linguistic 
Ethnicity/race 
Cultural/religious 
Geographic 
SES 

Disability 

Gender 

DIF analyses: 
Linguistic 
Ethnicity/race 
Geographic 
SES 

Disability 

Gender 


'Placement in this column is related to the nature and maturity of the instrument. 
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